(12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(19) World Intellectual Property Organization 
International Bureau 

(43) International Publication Date 
25 May 2001 (25.05.2001) 




(10) International Publication Number 

PCT WO 01/37097 Al 



(51) International Patent Classification 1 : G06F 12/00, 7/36 (72) Inventor; and 

(75) Inventor/Applicant (for US only): VICTOR, Timothy, 



(21) International Application Number: PCT/US0G/31399 

(22) International Filing Date: 

15 November 2000 (15.1 1.2000) 



(25) Filing Language: 

(26) Publication Language: 



English 



W. [US/US): 1020 Riverwalk Drive. Phoenix vilie, PA 
19460 (US). 

(74) Agents: KANAGY, James, M. et al.; SmilhKline 
Beecham Corporation* Corporate Intellectual Property. 
UW2220, 709 Swedeland Road, P.O. Box 1539. King of 
Prussia, PA 19406-0939 (US). 

English (gjj Designated States (national): AE, AL, AU, BA. BB, BG, 
BR, BZ, CA, CN, CZ, DZ, EE, GE, GH, GM, HR. HU, ID. 
IL, IN, IS, JP, KP, KR. LC, LK. LR, LT, LV, MA, MG. MK, 
MN. MX. MZ, NO, NZ, PL, RO. SG, SI, SK, SL. TR. TT, 
TZ, UA. US, UZ, VN, YU. ZA. 

(71) Applicant (for ail designated States except US): 

SM1THKLINE BEECHAM CORPORATION (84) Designated States (regional): ARIPO patent (GH, GM. 



(30) Priority Data: 
60/165,621 



15 November 1999 (15.1 1.1999) US 



[US/US]; One Franklin Plaza, Philadelphia. PA 19103 
(US). 



KE, LS. MW. MZ, SD, SL, SZ, TZ, UG, ZW), Eurasian 
patent (AM, AZ. BY, KG, KZ, MD, RU, TJ, TM), European 

[Continued on next page] 



= (54) Title: METHOD FOR IDENTIFYING UNIQUE ENTITIES IN DISPARATE DATA FILES 



% 



O 




(57) Abstract: This invention relates to a method of match- 
ing computer-based records (301) for identifying unique en- 
tities (303) both within and between disparate data files. This 
method of record-linkage has particular utility in the fields of 
epidemiology and health services research. 



BEST AVAILABLE COPY 



BMSOOCID: <WO 0137097A1 J_> 



WO 01/37097 Al IlllHllHIOIDnHllflllllUOl'' 



patcni (AT. BE. CR CY, DE, DK, ES. FI, FR, GB> OR, IE, 
IT. LU, MC. NL. PT. SE. TR). OAPI patent (BF, BJ, CF, 
CG. a. CM, GA, GN. GW, ML, MR. NE, SN, TD, TG). 



Published: 

— With international search report. 



— Before the expiration of the time limit for amending the 
claims and to he republished in the event of receipt of 
amendments. 

For two-letter codes and other abbreviations, refer to the "Gwd- 
once Notes on Codes and Abbreviations 91 appearing at the begin- 
ning of each regular issue of the PCT Gazette. 



BNSOOCID: <WO 0!37097A1J_> 



WO 01/37097 



PCT/DS00/31399 



Method for Identifying Unique Entities in Disparate Data Files 
Field of the Invention 
This invention relates to a method of matching computer-based records for 
identifying unique entities both within and between disparate data files. This method of 
5 record-linkage has particular utility in the fields of epidemiology and health services 
research. 

gackground of the Invention 
A custom universal identifier methodology was developed in response to the 
limitations of exact matching techniques. The methodology was designed to incorporate a 

] 0 combination of exact and probabilistic matching techniques. The term record linkage has 
been used to indicate the bringing together of two or more separately recorded pieces of 
information concerning a particular entity. Integrating patient information from various 
sources is essential for multivariate research. The various facts concerning an individual, if 
brought together, form an extensive history of that individual. 

1 5 There are many purposes for linking records. Examples range from obtaining more 

data elements about an individual by merging data from different data sources, to creating a 
more comprehensive name and address list by merging the names and address from several 
data sources. In the first case, it is important to ensure that the matching is done accurately 
so that the matched data truly represent a multivariate observation from a single individual. 

20 In the second, the merging is intended to ensure as complete a list as possible while 
eliminating duplication. 

The idea of linkage records in the interest of science has a long pedigree. Fisher 
(Box, 1979, p. 237) lectured at a Zurich public health congress in 1929, arguing the 
usefulness of public records supplemented by (and presumably linked with) family data, .in 

25 human genetics research. Earlier, Alexander Graham Bell exploited genealogical records 
and administrative records on marriages, census results and others apparently linking some 
sources, to sustain his familial studies of deafness (Bruce, 1973; Bell, 1906). 

For many applications involving multiple databases, enough information is present 
to allow an accurate human judgement about whether a record from one source refers to the 

30 same case as a record from other sources. However, this is an extremely time-consuming, 
error-prone, and unreliable method except for small data sets. Computer methods are 
necessary to perform this task for a record matching exercise to be cost effective. 
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Summary of the Invention 
The present invention is a computer-implemented system and method for creating a 
universal identifier for more than one record in one or more data files, the process 
comprising: 

5 standardizing one or more data elements in each record; 

estimating the agreement and disagreement weights employed in the probabilistic 
function; and 

assigning a randomly generated unique identifier to each record. 
In a second aspect, this invention relates to a computer-implemented system and 
1 0 method for concatenating records belonging to the same source within a data base or 
between data bases, the process comprising: 

( 1 ) creating a universal identifier for each record in one or more data files, by: 

a) standardizing one or more data elements in each record; 

b) estimating the agreement and disagreement weights employed in 
1 5 the probabilistic function; and 

c) assigning a randomly generated unique identifier to each record; 

and 

(2) concatenating records having the same unique identifier. 

In yet a third aspect, this invention relates to a computeT-implemented system and 
20 method for concatenating records belonging to the same source where some records have a 
unique identifier and new records are created, the process comprising: 

(1 ) creating a universal identifier for each new record in one or more data files, 

by: 

a) standardizing one or more data elements in each record; 
25 b) estimating the agreement and disagreement weights employed in 

the probabilistic function; and 

c) assigning a randomly generated unique identifier to each record; 

and 

(2) concatenating records newly assigned a unique identifier with existing 
30 records having the same unique identifier. 

Descripiton of the Figures 

Figure 1 is a block diagram of illustrative input record components and atomic 
components. 

Figure 2 is a flowchart of weights calculated based on chance agreement using an 
35 iterative bootstrap technique. 
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Future 3 is a flowchart of the process for generating randomly assigned unique 
identifiers. 

Description of the Invention 

General Overview 

5 This invention provides a means for generating a unique identifier for records that 

ultimately relate back to a single source. It is particularly useful where characterizing data 
identifying that source expands or changes over time. Specific examples are financial data 
and patient data. However, in both instances, data can normally be stored in a centralised 
data file such as a central server only if it is adequate secured and anonymized. One way to 

1 0 effect this security interst is to use a trusted third party-environments. This invention has its 
greatest use in the trusted third-parly environment. 

A Trusted Third Party (TTP) service is a current way for anonymizing patient data. 
The data is sent to a TTP, which takes the data and replaces all patient identifiers with a new 
code. The TTP matches codes against the patients - it therefore knows all the codes and 

1 5 patients. 

Working within the pervue of a TTP, or elsewhere, this invention address the step of 
creating and assigning a unique identifier to a record after which these records are 
concatenated based on the unique identifier. The creation and assignment steps have three 
major components: i) data standardization, ii) weight estimation, and iii) the assignment of a 
20 unique identifier, in that order. 
Definitions 

For the purposes of this invention, the following definitions and abbreviations are 

used: 

25 p -Probability: The probability that any random element pair will match by chance 



yivh 

p-Probability: The reliability of the data element. If the Element Error Rate is > .99 then 
p = l-££/{;Else p = .99-EER 

30 

Agreement: A condition such that a given element pair matches exactly and both elements 
are known A = /? 

-3- 
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Agreement Weight: The weight assigned to an element pair when they agree daring the 
record matching process 




5 Cartesian Product: The set of ordered pairs A* B = {{a t b) \ a e A a b € B} 

Disagreement: A condition such that a given element pair does not exactly match and both 
elements are known 

1 0 Disagreement Weight: The weight assigned to an element pair when they disagree during 
the record matching process. 

Element Error Rate: The proportion of element pairs where at least one element is 
1 5 unknown, e.g., null 

n A . H 

Frequency Table: Summary of the number of times, and percentage of total different values 
of a variable occur 

20 

Mean: Arithmetic average 

— -I " 
Xt X, 
n #=i 

No Decision: A condition such that a given element pair where either one or both of the 
25 elements is unknown. 

Random Number Assignment: Every row in the data set will be assigned a random number 
such that v blocks of approximately 1500 are created p = int[(u* P)+ 1] where p - 
Random Number, v = Upper Bound and P = Random Function. 

-4- 
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Threshold: The threshold utilized in probabilistic matching is a binii odds ratio with a range 
of -oo>x<°°. 

5 Upper Bound: Number of strata such that the data set is divided into approximately equal 
rows of 1500. 

. ( Number of Records in Data Set 

u=,n, l ^5 

As regards the computer and machine language used in this process, just about any 
1 0 piece of hardware capable of executing a fairly large number of calculations in shrot order 
will fill the bill. Any current state-of-the-art PC or server could be used. As for the 
operating system, UNIX is perferred, but Windows 98 or NT for Windows or the like could 
be used. The source code can be written in any language, though Java if preferred. 

Data Standardization 

\ 5 The first step of this process involves the standardization of data in an input file. 

This standardization is required for increased precision and reliability. The input file can 
contain an number of variables of which one or more are or may be unique to a particular 
data source such as an individual. Examples of useful variables are: member identifier, 
drivers' license number, social security number, insurance company code number, name, 

20 gender, date of birth, street address, city, state, postal code, citizenship. In addition, some 
identifiers can be further distilled down into their basic, or atomic, components. Figure 1 
illustrates the use of selected input record components and atomic components of some 
records that are amenable lo such further distillation. Referring to Figure 1, Input Record 
100 illustrates data which can be used as the basis for assigning a unique identifier, and how 

25 that data can be broken out inot its atomic and subatomic components exemplified by Street 
Address 110, Date of Birth 120 and Name 1 30. 

During the standardization process, all character data is preferably transformed to a 
single case. For example they are transformed to uppercase. So for instance, first names are 
standardized to uppercase, e.g., {BOB, ROB, ROBBY} = ROBERT. Common names for 

30 cities and streets may be transformed to the postal code, e.g., in the U.S. 1o United States 

Postal Service standard. In the latter instance this can be done using industry standard CASS 
certified software. 
Weiaht Estimation 

-5- 
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A fundamental component of this algorithm is the process of estimating the 
agreement and disagreement weights necessary for the probabilistic function. Weights are 
calculated based in probabilities of chance agreement using an iterative bootstrap technique. 
Figure 2 provides a flow of the process. 
5 The first step in the weight estimation process is to determine the number of strata 

required such that the data set can be divided into approximately equal blocks of 1500 rows 
(Fig. 2 - 201-219), see equation 1. 

. / Number of Records in Data Set ^ 

u = mt (1) 

I 1500 v 

The source file is then scanned and the records are assigned a random number between 1 

1 0 and \). A data matrix is created containing a Cartesian product of records with a random 

number of I assigned. The resulting matrix is then scanned. Each element pair within each 

record pair is assessed and assigned a value in the following manner: 

^ 1 if £ = B e (Agreement) 

e n = \ 0 if A = Nul1 *n<Vo T B = Nul1 Wo decision) 

e " e ' (2) 

I — 1 if A * ft (Disagreement) 

where J[ is the nth element from record A 

Cm 

Once the matrix has been fully assessed, percentages for each g w are tabulated and stored. 

1 5 This process is repeated for 1 5 iterations. 

Mean percentages of Agreements and No Decisions are calculated for each data 
element (Fig. 2 - 221 ). The p probability, or the reliability, for each data element is then 
calculated, see equation 3. 

let e = v 

fife* .99 then} -e ( 3 ) 
P ~ else. 99 -e 

20 The n probability, or the probability that element n for any given record pair will match by 
chance, is calculated (Fig. 2 • 223), see equation 4. 

f^X, , (4 > 

Percent Agreement 

From the p and |i probabilities, the disagreement and agreement weight formula are 
calculated (Fig. 2 - 225)employing equations 5 and 6 respectively. 
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Disagreement = log,|j — jj (5> 

1 (P^ 
Agreement = lOgJ (&) 

Uni que Identifier Assignment 

The final stage of this process is the action of uniquely identifying entities within 
5 the input data set. Figure 3 provides an overview of this process. 

Each record from the input file is evaluated against a reference file to determine if 
the entity represented by the data has been previously identified using a combination of 
deterministic and probabilistic matching techniques. If it is judged that the entity is already 
represented in the reference set, the input record is assigned the unique identifier (UID) 
1 0 from the reference record that it has matched against. If it is judged that the entity 

represented by data is not yet in the reference set, a new UID is randomly generated and 
assigned. Random numbers are generated in whatever language the process is being 
implemented. 

After the UID assignment occurs, the input record is evaluated, in it's entirety, to 
1 5 determine if the record is a unique representation of the entity not already contained in the 
reference table. If it is a new record, then it is inserted into the reference table for future use. 
Deterministic Matching Technique 

The deterministic matching technique employs simple Boolean logic. Two records 
are judged to match if certain criteria are met, such as the following: 
20 First Name Matches Exactly 

Last Name Matches Exactly 
Date of Birth Matches Exactly 

Social Security Number OR Member Identifier Matches Exactly 
If two records satisfy the criteria for deterministic matching, no probabilistic 
25 processing occurs. However, if no deterministic match occurs, the input record is presented 

for a probabilistic match. 

Probabilistic Matching Technique 

The first step in the probabilistic matching process is to build a set of candidate 

records from the reference table based on characteristics of specific elements of the input 
30 record. This process is referred to as blocking, the set of candidate records is referred to as 

the blocking table. All data sets do not use the same characteristics, the elements used in this 

process are determined through data analysis. However, it is suggested that blocking 
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variable consist of those elements that are somewhat unique to an element, e.g.. social 
security number, or a combination of dale of birth and last name. 

Upon completion of the construction of the blocking table, each element for each 
candidate record is compared against its corresponding element from the input record. See 
equation 7 for the scoring mechanism. 

r 



(7) 



1 Agreement Weight if - ft 
W„ = l°ifA = Null and/org = Null 

I Disagreement Weight if * 
where A is the nth element from record A 
A composite weight is then calculated for all candidate records, see equation 8. 



1*1 



The candidate record with the highest composite weight is then evaluated against a 
1 0 predefined threshold. If the weight meets or exceeds the threshold, the candidate record is 
judged to match the input record. If the weight does not exceed the threshold, it is assumed 
that the input record represents an entity not yet included in the reference set. 
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What is claimed is: 

1 . A computer-implemented system and method for creating a universal identifier for 
more than one record in one or more data files, the process comprising: 

standardizing one or more data elements in each record; 

estimating the agreement and disagreement weights employed in the probabilistic 
function; and 

assigning a randomly generated unique identifier to each record. 

2. A computer-implemented system and method for concatenating records belonging 
to the same source within a data base or between data bases, the process comprising: 

(A) creating a universal identifier for each record in one or more data files, by: 

a) standardizing one or more data elements in each record; 

b) estimating the agreement and disagreement weights employed in (he 
probabilistic function; and 

c) assigning a randomly generated unique identifier to each record; and 

(B ) concatenating records having the same unique identifier. 

3. A computer-implemented system and method for concatenating records belonging 
to the same source where some records have a unique identifier and new records are created, 
the process comprising: 

(A) creating a universal identifier for each new record in one or more data files, 

by: 

a) standardizing one or more data elements in each record; 

b) estimating the agreement and disagreement weights employed in the 
probabilistic function; and 

c) assigning a randomly generated unique identifier to each record; and 

(B) concatenating records newly assigned a unique identifier with existing 
records having the same unique identifier. 

4. A method for assigning a unique identification number to a source or owner data as 
described herein. 
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